Getting Started

ANU BDSI
workshop
Introduction to R programming

Emi Tanaka

Biological Data Science Institute

3rd April 2024

Welcome 👋

Teaching team

Dr. Emi Tanaka

Helper TBD
  • Who are you?
    • What statistical software have you used before?
    • Introduce yourself to people around you

Workshop materials

All materials will be hosted at
https://anu-bdsi.github.io/workshop-intro-R/

Learning objectives

The main aim is for you to get started with using R for basic computations.

  • Conduct elementary arithmetic operations using R
  • Grasp the concept of missing values within the R environment
  • Compute basic summary statistics including mean, median, quartiles, and standard deviation using R
  • Install external packages in R to extend functionality
  • Manipulate lists, matrices, and vectors in R
  • Navigate the RStudio interactive development environment (IDE)
  • Import and export data in R
  • Comprehend various object types in R
  • Create basic functions, employ conditional statements, and utilize for loops in R
  • Decipher error messages and do basic troubleshooting

What is R?

  • R is a programming language predominately for data analysis
  • RStudio Desktop is an integrated development environment (IDE) that helps you to use R

How to use R?

  • RStudio Desktop (or RStudio IDE) is the most common way to use R

  • You can type operations directly into the Console pane

Live demo

Customise Global Options

  • Go to RStudio > Tools > Global Options…
  • Under the General tab, make sure the “Restore .RData into workspace at startup” is unticked.
  • This avoids unexpectedly loading (old) data into your workspace and making your code only work in your workspace, but not for others (which is bad reproducible practice).

R Packages

  • R packages are community developed extensions to R (much like apps on your mobile)
  • The Comprehensive R Archive Network (CRAN) is a volunteer maintained repository that hosts submitted R packages that are approved (much like an app store)
    • There are close to 20,000 packages available on CRAN
    • The qualities of R packages vary
  • There are other repositories that host R packages, e.g. Bioconductor for bioinformatics, R Universe, R-Forge, GitHub (we won’t cover these)

Photo by Sara Kurfeß on Unsplash

Why learn R?

  • R is one of the top programming languages for statistics or data science
    • Python is also a good alternative language for data science
    • Better to have a mastery of at least one language rather than none
  • R was initially developed by statisticians for statisticians
    • State-of-the-art statistical methods are generally more readily available in R
  • R has an active and friendly community
  • R is a free and open source software (FOSS)
    • free = money is not a barrier to use it
    • open source software = transparency

How to get better at R?

  • PRACTICE
  • Practice with a purpose (e.g. using R on your own data)
  • Try teaching and helping others with their R problem
  • Have a willingness to continuously learn and adapt
    • R is an ever evolving language (check the release news every so often)
    • new features and packages are added very frequently
    • whether you are a beginner or not, there are always things we do not know about R
  • Do you have any strategies or tips? Please share!

Functions in R

  • There are many functions in R!
  • Generally if you need to compute some numerical summary that is common in your field, then there is probably already an existing function in the R ecosystem.
  • Try always searching it on a search engine (e.g. Google) with the right keywords.
  • If it’s computed from a community contributed R package, then check to see if there’s some quality indicators:
    • Is it actively maintained?
    • Is it widely used?
    • Does the package have tests for its functions? Etc.

Base packages

  • R has 7 packages (stats, graphics, grDevices, utils, datasets, methods, base), collectively referred to as the “base packages”, that are loaded automatically when you launch it.
  • The functions in the base packages are generally well-tested and trustworthy.

Artihmetics

  • Many of the arithmetic functions come from base.
  • You can see library(help = "base") for indexed help files.
sqrt(3)
[1] 1.732051
abs(-3)
[1] 3
exp(1)
[1] 2.718282
log(4, base = exp(1))
[1] 1.386294
sum(1:3)
[1] 6

Numerical summaries

  • Numerical summaries generally come base or stats package.
  • Some common numerical summaries include:
    • Mean: mean()
    • Median: median()
    • Five number summary: fivenum()
    • Minimum: min()
    • Maximum: max()
    • Quantile: quantile()
    • Correlation coefficient: cor()

Missing values

  • NA in R denotes missing values – there are in fact different types of missing values (NA_character_, NA_integer_, NA_real_, NA_complex_).
  • When there are missing values, it can cause issues in the computation.
x <- c(2.3, NA, 4.7)
mean(x)
[1] NA
  • Below we remove the missing values:
mean(x, na.rm = TRUE)
[1] 3.5
  • Notice that the above is different to below when there are missing value(s):
sum(x, na.rm = TRUE) / length(x)
[1] 2.333333

Some parametric distributions

  • The density (d), distribution (p) or quantile (q) functions of a parametric distribution are generally in the stats package.

  • There are functions to generate random values from a particular parametric distribution (r).

  • Some examples are:

Normal distribution

  • dnorm()
  • pnorm()
  • qnorm()
  • rnorm()

t-distribution

  • dt()
  • pt()
  • qt()
  • rt()

Poisson distribution

  • dpois()
  • ppois()
  • qpois()
  • rpois()

F distribution

  • df()
  • pf()
  • qf()
  • rf()